Online Learning in Markov Decision Processes with Changing Cost Sequences
نویسندگان
چکیده
In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the statevisitation probabilities are uniformly bounded away from zero under all policies.
منابع مشابه
Utilizing Generalized Learning Automata for Finding Optimal Policies in MMDPs
Multi agent Markov decision processes (MMDPs), as the generalization of Markov decision processes to the multi agent case, have long been used for modeling multi agent system and are used as a suitable framework for Multi agent Reinforcement Learning. In this paper, a generalized learning automata based algorithm for finding optimal policies in MMDP is proposed. In the proposed algorithm, MMDP ...
متن کاملFast rates for online learning in Linearly Solvable Markov Decision Processes
We study the problem of online learning in a class of Markov decision processes known as linearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed state-dependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an onli...
متن کاملRelax but stay in control: from value to algorithms for online Markov decision processes
Online learning algorithms are designed to perform in non-stationary environments, but generally there is no notion of a dynamic state tomodel constraints on current and future actions as a function of past actions. State-based models are common in stochastic control settings, but commonly used frameworks such as Markov Decision Processes (MDPs) assume a known stationary environment. In recent ...
متن کاملOnline Learning in Stochastic Games and Markov Decision Processes
In their standard formulations, stochastic games and Markov decision processes assume a rational opponent or a stationary environment. Online learning algorithms can adapt to arbitrary opponents and non-stationary environments, but do not incorporate the dynamic structure of stochastic games or Markov decision processes. We survey recent approaches that apply online learning to dynamic environm...
متن کاملTitle of dissertation : LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES
Title of dissertation: LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES Abraham Thomas, Doctor of Philosophy, 2009 Dissertation directed by: Professor Steven Marcus Department of Electrical and Computer Engineering We propose various computational schemes for solving Partially Observable Markov Decision Processes with the finite stage additive cost and infinite horizon discounted cost criterio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014